%%HTML
<script src="require.js"></script>
Marks: 60
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
The data contains the different data related to a food order. The detailed data dictionary is given below.
import plotly.io as pio
pio.renderers.default='notebook'
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
!pip install plotly
import plotly.express as px
import plotly.figure_factory as ff
# to restrict the float value to 2 decimal places
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# To display the graphs
%matplotlib inline
# giving colab access to drive
from google.colab import drive
drive.mount('/content/drive')
# importing the data
df = pd.read_csv('/content/drive/MyDrive/DS&BA @ UTA/Module 1/Project - FoodHub/foodhub_order.csv')
# returning the first 5 rows to check all data was correclty imported
df.head()
# looking at the total columns and rows of the df
df.shape
# printing the information summary of the DataFrame
df.info()
# looking for missing values
df.isnull().sum()
# looking at the descriptive overview of the data
df.describe(include='all').T
# looking at the count of each rating value
df['rating'].value_counts()
fig=px.histogram(df, x='cost_of_the_order')
fig.show()
fig=px.box(df, x='cost_of_the_order')
fig.show()
fig=px.histogram(df, x='food_preparation_time')
fig.show()
fig=px.box(df, x='food_preparation_time')
fig.show()
fig=px.histogram(df, x='delivery_time')
fig.show()
fig=px.box(df, x='delivery_time')
fig.show()
fig=px.histogram(df, x='restaurant_name')
fig.show()
# calculating the percentages of orders from each restaurant
df.restaurant_name.value_counts()*100 / len(df)
fig=px.histogram(df, x='cuisine_type')
fig.show()
# calculating the percentages of orders from each type of cuisine
df.cuisine_type.value_counts()*100 / len(df)
fig=px.histogram(df, x='day_of_the_week')
fig.show()
# calculating the percentages of orders depending on the days of the week
df.day_of_the_week.value_counts()*100 / len(df)
fig=px.histogram(df, x='rating')
fig.show()
# calculating the percentages for each type of rating given
df['rating'].value_counts()*100 / len(df)
# calculating the total number of orders from each restaurant
rest_name = ['restaurant_name']
for column in rest_name:
print(df[column].value_counts(ascending=False))
# calculating the total number of orders from each type of cuisine
df['cuisine_type'].value_counts(ascending=False)
# calculating the total number of orders during weekends from each type of cuisine
df.loc[df['day_of_the_week']=='Weekend','cuisine_type'].value_counts(ascending=False)
# getting the sum of all orders with a cost over 20
df.loc[df['cost_of_the_order']>20].shape
# getting total of orders
df.shape
# getting the % of orders over 20
orders_over_twenty=(555/1898)*100
print(orders_over_twenty)
# getting the mean of the delivery time
df['delivery_time'].mean()
# finding the top most recurring users
df['customer_id'].value_counts(ascending=False)
# corr analysis between numerical variables
sns.pairplot(df)
plt.show()
fig=px.box(df,x='cuisine_type',y='cost_of_the_order')
fig.show()
fig=px.scatter(df,x='cuisine_type',y='cost_of_the_order', color='day_of_the_week')
fig.show()
fig=px.box(df,x='cuisine_type',y='food_preparation_time')
fig.show()
fig=px.scatter(df,x='cuisine_type',y='food_preparation_time', color='day_of_the_week')
fig.show()
fig=px.box(df,x='cuisine_type',y='delivery_time')
fig.show()
fig=px.scatter(df,x='cuisine_type',y='delivery_time', color='day_of_the_week')
fig.show()
fig=px.box(df,x='day_of_the_week',y='delivery_time')
fig.show()
fig=px.histogram(df,x='cuisine_type',color='day_of_the_week')
fig.show()
fig=px.histogram(df,x='cuisine_type',color='rating')
fig.show()
# looking at the count of 5 stars ratings given per cuisine
df.loc[df['rating']=='5','cuisine_type'].value_counts(ascending=False)
# looking at the count of 4 stars ratings given per cuisine
df.loc[df['rating']=='4','cuisine_type'].value_counts(ascending=False)
# looking at the count of 3 stars ratings given per cuisine
df.loc[df['rating']=='3','cuisine_type'].value_counts(ascending=False)
# looking at the count of not rated orders per cuisine
df.loc[df['rating']=='Not given','cuisine_type'].value_counts(ascending=False)
fig=px.histogram(df,x='restaurant_name',y='cost_of_the_order')
fig.show()
# looking for the average cost per order from Shake Shack
df.loc[df['restaurant_name']=='Shake Shack','cost_of_the_order'].mean()
fig=px.box(df,x='day_of_the_week',y='cost_of_the_order')
fig.show()
fig=px.box(df,x='rating',y='cost_of_the_order')
fig.show()
# looking at the average cost of orders depending on the rating given
df.groupby(['rating'])[['cost_of_the_order']].mean()
fig=px.histogram(df,x='day_of_the_week',color='rating')
fig.show()
fig=px.box(df,x='rating',y='food_preparation_time')
fig.show()
# looking at the average food preparation time categorized by rating
df.groupby(['rating'])[['food_preparation_time']].mean()
fig=px.box(df,x='rating',y='delivery_time')
fig.show()
# looking at the average delivery time categorized by rating
df.groupby(['rating'])[['delivery_time']].mean()
# creating new column with the total time of an order, from the time it's placed to when it gets to it's destination
df['total_time'] = df['delivery_time'] + df['food_preparation_time']
df['total_time'] = df['total_time'].astype(int)
# making sure the column was added correctly
df.head()
fig=px.box(df,x='rating',y='total_time')
fig.show()
# looking at the average total time categorized by rating
df.groupby(['rating'])[['total_time']].mean()
# looking at the types of 'rating' and their count
df['rating'].apply(type).value_counts()
# replacing ''Not given' values with nan
df['rating'] = df['rating'].replace(['Not given'],np.nan)
# changing the data type to float
df['rating'] = df['rating'].astype(float)
# returning the different types of cuisines
df['cuisine_type'].unique()
# looking at American restaurants and their rating count
df.loc[df['cuisine_type']=='American'].groupby("restaurant_name")["rating"].count()
# looking at Blue Ribbon Fried Chicken and its average rating
df.loc[df['restaurant_name']=='Blue Ribbon Fried Chicken'].groupby("restaurant_name")["rating"].mean()
# looking at Shake Shack and its average rating
df.loc[df['restaurant_name']=='Shake Shack'].groupby("restaurant_name")["rating"].mean()
The next American restaurants qualify for the promotion:
# looking at Korean restaurants and their rating count
df.loc[df['cuisine_type']=='Korean'].groupby("restaurant_name")["rating"].count()
No Korean restaurants qualify for the promotion.
# looking at Japanese restaurants and their rating count
df.loc[df['cuisine_type']=='Japanese'].groupby("restaurant_name")["rating"].count()
# looking at Blue Ribbon Sushi and its average rating
df.loc[df['restaurant_name']=='Blue Ribbon Sushi'].groupby("restaurant_name")["rating"].mean()
Blue Ribbon Sushi is the only Japanese restaurant that qualifies for the promotion.
# looking at Mexican restaurants and their rating count
df.loc[df['cuisine_type']=='Mexican'].groupby("restaurant_name")["rating"].count()
No Mexican restaurants qualify for the promotion.
# looking at Indian restaurants and their rating count
df.loc[df['cuisine_type']=='Indian'].groupby("restaurant_name")["rating"].count()
No Indian restaurants qualify for the promotion.
# looking at Italian restaurants and their rating count
df.loc[df['cuisine_type']=='Italian'].groupby("restaurant_name")["rating"].count()
# looking at The Meatball Shop and its average rating
df.loc[df['restaurant_name']=='The Meatball Shop'].groupby("restaurant_name")["rating"].mean()
The Meatball Shop is the only Italian restaurant that qualifies for the promotion.
# looking at Mediterranean restaurants and their rating count
df.loc[df['cuisine_type']=='Mediterranean'].groupby("restaurant_name")["rating"].count()
No Mediterranean restaurants qualify for the promotion.
# looking at Chinese restaurants and their rating count
df.loc[df['cuisine_type']=='Chinese'].groupby("restaurant_name")["rating"].count()
No Chinese restaurants qualify for the promotion.
# looking at Middle Eastern restaurants and their rating count
df.loc[df['cuisine_type']=='Middle Eastern'].groupby("restaurant_name")["rating"].count()
No Middle Eastern restaurants qualify for the promotion.
# looking at Thai restaurants and their rating count
df.loc[df['cuisine_type']=='Thai'].groupby("restaurant_name")["rating"].count()
No Thai restaurants qualify for the promotion.
# looking at Southern restaurants and their rating count
df.loc[df['cuisine_type']=='Southern'].groupby("restaurant_name")["rating"].count()
No Southern restaurants qualify for the promotion.
# looking at French restaurants and their rating count
df.loc[df['cuisine_type']=='French'].groupby("restaurant_name")["rating"].count()
No French restaurants qualify for the promotion.
# looking at Spanish restaurants and their rating count
df.loc[df['cuisine_type']=='Spanish'].groupby("restaurant_name")["rating"].count()
No Spanish restaurants qualify for the promotion.
# looking at Vietnamese restaurants and their rating count
df.loc[df['cuisine_type']=='Vietnamese'].groupby("restaurant_name")["rating"].count()
No Vietnamese restaurants qualify for the promotion.
# Looking at orders over 20 dollars from each week day category
df.loc[df['cost_of_the_order']>20].groupby("day_of_the_week")["cost_of_the_order"].count()
# adding up the orders from above to get the total number of orders over 20
total_orders_above_twenty=158+397
print(total_orders_above_twenty)
# Looking at the sum of orders over 20 from each week day category
df.loc[df['cost_of_the_order']>20].groupby("day_of_the_week")["cost_of_the_order"].sum()
# adding up the amounts from above to get the total cost of orders over 20
total_cost_orders_over_twenty=4174.49+10580.42
print(total_cost_orders_over_twenty)
# calculating the total revenue made from orders over 20
revenue_made_orders_over_twenty=14754.91*.25
print(revenue_made_orders_over_twenty)
# creating function to calculate revenue generated per order
def revenue_generated(*costs):
"""
This function takes the cost of the orders,
applies the corresponding interest,
and returns the total revenue generated by interests.
"""
total_revenue= 0
for cost in costs:
if cost > 20:
total_revenue += cost * .25
elif cost > 5:
total_revenue += cost * .15
else:
total_revenue += cost * 0
# return the total revenue
return total_revenue
# applying the function to the df
df['cost_of_the_order'].apply(revenue_generated)
# adding a new column for the revenue
df['revenue']=df['cost_of_the_order'].apply(revenue_generated)
df.head()
# adding up the revenue column
df['revenue'].sum()
# getting the percentage of orders that took longer than 60 minutes
df.loc[df['total_time']>60].count()*100 / len(df)
# looking at the total of orders made on weekends
df.loc[df['day_of_the_week']=='Weekend'].groupby("day_of_the_week").count()
There's a total of 1351 orders on weekends.
# returning the sum of all delivery time for orders during weekends
df.loc[df['day_of_the_week']=='Weekend'].groupby("day_of_the_week")['delivery_time'].sum()
# rerturning the average delivery time for orders during weekends
df.loc[df['day_of_the_week']=='Weekend'].groupby("day_of_the_week")['delivery_time'].mean()
# rerturning the average delivery time for orders during weekdays
df.loc[df['day_of_the_week']=='Weekday'].groupby("day_of_the_week")['delivery_time'].mean()